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I, Paul Polakis, Ph.D., declare and say as follows: 



1 



I was awarded a Ph.D. by the Department of Biochemistry of the Michigan 
State University in 1984. My scientific Curriculum Vitae is attached to and forms 
part of this Declaration (Exhibit A). 

2. I am currently employed by Genentech, Inc. where my job title is Staff 
Scientist. Since joining Genentech in 1999, one of my primary responsibilities has 
been leading Genentech's Tumor Antigen Project, which is a large research project 
with a primary focus on identifying tumor cell markers that find use as targets for 
both the diagnosis and treatment of cancer in humans. 

3. As part of the Tumor Antigen Project, my laboratory has been analyzing 
differential expression of various genes in tumor cells relative to normal cells. 
The purpose of this research is to identify proteins that are abundantly expressed 
on certain tumor cells and mat are either (i) not expressed, or (ii) expressed at 
lower levels, on corresponding normal cells. We call such differentially expressed 
proteins "tumor antigen proteins". When such a tumor antigen protein is 
identified, one can produce an antibody that recognizes and binds to that protein. 
Such an antibody finds use in the diagnosis of human cancer and may ultimately 
serve as an effective therapeutic in the treatment of human cancer. 

4. In the course of the research conducted by Genentech's Tumor Antigen 
Project, we have employed a variety of scientific techniques for detecting and 
studying differential gene expression in human tumor cells relative to normal cells, 
at genomic DNA, mRNA and protein levels. An important example of one such 
technique is the well known and widely used technique of microarray analysis 
which has proven to be extremely useful for the identification of mRNA molecules 
that are differentially expressed in one tissue or cell type relative to another. In the 
course of our research using microarray analysis, we have identified 
approximately 200 gene transcripts that are present in human tumor cells at 
significantly higher levels than in corresponding normal human cells. To date, we 
have generated antibodies that bind to about 30 of the tumor antigen proteins 
expressed from these differentially expressed gene transcripts and have used these 
antibodies to quantitatively determine the level of production of these tumor 
antigen proteins in both human cancer ceils and corresponding normal cells. We 
have then compared the levels of mRNA and protein in both the tumor and normal 
cells analyzed. 



5. From the mRNA and protein expression analyses described in paragraph 4 
above, we have observed that there is a strong correlation between changes in the 
level of mRNA present in any particular cell type and the level of protein 



expressed from that mRNA in that cell type. In approximately 80% of our 
observations we have found mat increases in the level of a particular mRNA 
correlates with changes in the level of protein expressed from that mRNA when 
human tumor cells are compared with their corresponding normal cells. 

6. Based upon my own experience accumulated in more than 20 years of 
research, including the data discussed in paragraphs 4 and 5 above and my 
knowledge of the relevant scientific literature, it is my considered scientific 
opinion that for human genes, an increased level of mRNA in a tumor cell relative 
to a normal cell typically correlates to a similar increase in abundance of the 
encoded protein in the tumor cell relative to the normal cell. In fact, it remains a 
central dogma in molecular biology that increased mRNA levels are predictive of 
corresponding increased levels of the encoded protein. While there have been 
published reports of genes for which such a correlation does not exist, it is my 
opinion that such reports are exceptions to the commonly understood general rule 
that increased mRNA levels are predictive of corresponding increased levels of the 
encoded protein. 

7. I hereby declare that all statements made herein of my own knowledge are 
true and that all statements made on information or belief are believed to be true, 
and further that these statements were made with the knowledge that willful false 
statements and the like so made are punishable by fine or imprisonment, or both, 
under Section 1001 of Title 18 of the United States Code and that such willful 
statements may jeopardize the validity of the application or any patent issued 



thereon. 
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Summary 

Mitochondria are tailored to meet the metabolic and 
signaling needs of each cell. To explore its molecular 
composition, we performed a proteomic survey of mi- 
tochondria from mouse brain, heart, kidney, and liver 
and combined the results with existing gene annota- 
tions to produce a list of 591 mitochondrial proteins, 
including 163 proteins not previously associated with 
this organelle. The protein expression data were 
largely concordant with large-scale surveys of RNA 
abundance and both measures indicate tissue-spe- 
cific differences in organelle composition. RNA ex- 
pression profiles across tissues revealed networks of 
mitochondrial genes that share functional and regula- 
tory mechanisms. We also determined a larger "neigh- 
borhood" of genes whose expression is closely cor- 
related to the mitochondrial genes. The combined 
analysis identifies specific genes of biological interest, 
such as candidates for mtDNA repair enzymes, offers 
new insights into the biogenesis and ancestry of mam- 
malian mitochondria, and provides a framework for 
understanding the organelle's contribution to human 
disease. 

Introduction 

Mammalian mitochondria are ubiquitous organelles re- 
sponsible for 90% of ATP production in respiring cells. 
They are best known for housing the oxidative phos- 

'Correspondence: lander@genome.wi.mit.edu (E.S.L.), mann@bmb. 
sdu.dk (M.M.) 

6 These authors contributed equally to this work. 



phorylation (OXPHOS) machinery as well as enzymes 
needed for free fatty acid metabolism and the Kreb's 
cycle. Key steps of heme biosynthesis, ketone body 
generation, and hormone synthesis also reside within 
this organelle (Stryer, 1988). The mitochondrion gener- 
ates the majority of cellular reactive oxygen species 
(ROS) and has specialized scavenging systems to pro- 
tect itself and the cell from these toxic by-products. 
Furthermore, the organelle is crucial for cellular calcium 
signaling and hosts key machinery for programmed cell 
death, serving as a gatekeeper for apoptosis (Hocken- 
bery et al., 1 993; Kluck et al., 1 997). Given its contribution 
to cellular physiology, it is not surprising that this organ- 
elle can play an important role in human disease, such 
as diabetes, obesity, cancer, aging, neurodegeneration, 
and cardiomyopathy (Wallace, 1999). 

Mitochondria contain their own DNA (mtDNA) which 
is a compact genome encoding only 13 polypeptides. 
Through reductive evolution, the complement of genes 
constituting the original eubacterial predecessors of 
modern-day mitochondria have been either lost or trans- 
ferred from mtDNA to the nuclear genome (Andersson 
et al., 1998). Through an expansive process, the mito- 
chondrion has also acquired new proteins and function- 
ality. The exact number of mammalian mitochondrial 
proteins is not currently known, but estimates based 
on comparisons to the closest eubacterial relative of 
mammalian mitochondria, Rickettsia prowazakeii (An- 
dersson et al., 1998), comparisons to Saccharomyces 
cerevisiae (Kumar etal., 2002), and two-dimensional gel 
electrophoresis studies of isolated mammalian mito- 
chondria (Lopez et al., 2000; Rabilloud et al., 1 998) sug- 
gest that the organelle contains approximately 1200 
proteins. Although several recent studies have utilized 
proteomic and genetic approaches to expand the inven- 
tory of mammalian mitochondrial proteins, only 600-700 
mitochondrial proteins are currently known (Da Cruz et 
al., 2003; Lopez et al., 2000; Ozawa et al., 2003; Taylor 
et al., 2003; Westermann and Neupert, 2003). 

Classic electron microscopy studies have demon- 
strated morphologic differences in mitochondria from 
different cell types (Ghadially, 1 997). Moreover, the mito- 
chondria capacity, mtDNA copy number, enzymatic stoi- 
chiometry, carbon substrate utilization patterns, and 
biosynthetic pathways can be specialized (Stryer, 1988; 
Veltri et al., 1 990; Vijayasarathy et al., 1 998). Despite this 
apparent physiologic diversity, little is known about the 
molecular basis for these differences and the extent to 
which mitochondrial composition varies across different 
tissues. Emerging proteomics and genomics technolo- 
gies afford the opportunity to survey these properties. 
Here, we use mass-spectrometry-based proteomics 
(Aebersold and Mann, 2003) to profile mitochondrial 
composition across four different tissues, and we then 
correlate and extend the proteomic results with mRNA 
expression data across a much larger set of tissues. 

Our proteomic and RNA expression profiling study 
identified hundreds of gene products that are either lo- 
calized to this organelle or tightly coregulated with mito- 
chondrial genes, providing new hypotheses about the 
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molecular composition of this organelle and how net- 
works of genes may confer specialized function to mito- 
chondria. The survey serves as an initial step toward 
elucidating the transcriptional, translations, and protein 
targeting mechanisms likely operative in achieving tis- 
sue specific differences in mitochondrial form and 
function. 

Results and Discussion 

Proteomic Survey of Mouse Mitochondria 
We carried out a systematic survey of mitochondrial 
proteins from brain, heart, kidney, and liver of C57BL6/J 
mice (see Experimental Procedures). Each of these 
tissues provides a rich source of mitochondria. The iso- 
lation consisted of density centrifugation followed by 
Percoll purification. To assess the purity of our prepara- 
tions, we performed immunoblot analysis of the organ- 
elles using markers for mitochondria as well as for con- 
taminating organelles (Supplemental Figure S1 available 
online at http^/www.cell.com/cgi/content/full/115/5/ 
629/DC1). As can be seen in Supplemental Figure S1, 
the liver, heart, and kidney preparations were highly en- 
riched in mitochondria, while brain preparations tended 
to show persistent contamination by synaptosomes, 
which themselves are a rich source of neuronal mito- 
chondria (see Fernandez- Vizarra et al., 2002). 

Mitochondrial proteins from each tissue were solubi- 
lized and size separated by gel filtration chromatogra- 
phy into a batch of approximately 1 5-20 fractions (see 
Experimental Procedures). These proteins were then di- 
gested and analyzed by liquid chromatography mass 
spectrometry/mass spectrometry (LC-MS/MS). More than 
100 LC-MS/MS experiments were performed (see Ex- 
perimental Procedures). 

The acquired tandem mass spectra were then searched 
against the NCBI nonredundant database (October 2002) 
consisting of mammalian proteins using a probability- 
based method (Perkins et al., 1999). We used stringent 
criteria for accepting a database hit. Specifically, only 
peptides corresponding to complete tryptic cleavage 
specificity with scores greater than 25 were considered 
(see Experimental Procedures). Furthermore, we only 
accepted fragmentation spectra which also exhibited a 
correct, corresponding peptide sequence tag (Mann and 
Wilm, 1994) consisting of at least three amino acids. 

Using these criteria, 4766 proteins were identified. 
This list contains a high degree of redundancy, because 
a protein may have been found in adjacent gel-filtration 
fractions and in different tissues, and may correspond 
to different database entries corresponding to nearly 
identical proteins which have not been distinguished by 
mass spectrometry. To produce a nonredundant list of 
identified proteins, we used a permissive clustering rou- 
tine (see Experimental Procedures) based on BLAST 
(Altschul et al., 1990) to collapse the 4766 hits to a 
distinct set of 399 mouse protein clusters (see Figure 
1A; Supplemental Table S1 available at above URL). 

Previously Annotated Mitochondrial Proteins 
We created a list of previously annotated mouse and 
human mitochondrial proteins by pooling all the mouse 
and human proteins from MITOchondria Project (MITOP, 



http://mips.gsf.de/proj/medgen/mitop/), a public data- 
base of curated mitochondrial proteins, as well as all 
proteins annotated as mitochondrial in NCBI's LocusLink 
database (http://www.ncbi.nlm.nih.gov/LocusLink/) (Jan- 
uary 2003, see Experimental Procedures). After elimina- 
tion of redundancy, the list contains 428 distinct mouse 
proteins that are either directly annotated as mitochon- 
drial or whose human homolog is annotated as mito- 
chondrial (Figure 1A). We did not include the human 
proteins recently reported to be mitochondrial by Taylor 
et al. (2003) in a study published after the construction of 
our list of previously annotated proteins. These proteins 
instead serve as a control against which to compare the 
proteins identified in our proteomic analysis. The list of 
428 previously annotated mitochondrial proteins is by 
no means comprehensive— many mitochondrial pro- 
teins are simply not cataloged in these public databases. 
However, it does provide a reasonable, high confidence 
list of previously annotated proteins against which to 
benchmark our proteomic survey. 

Newly Identified Mitochondrial Proteins 
The set of 399 proteins identified in our proteomic survey 
include 236 of the 428 proteins previously annotated to 
be mitochondrial (55%) and 1 63 proteins not previously 
annotated as associated with the this organelle (Figure 
1 A). Combining the previous and new sets, we obtain a 
list of 591 mitochondria-associated proteins (mito-A) 
that are physically associated with this organelle (Sup- 
plemental Table S1 ). 

The 399 proteins identified in the proteomic survey 
span a wide range of molecular weight and isoelectric 
points (Figures 1 B and 1 C), although proteins from the 
inner mitochondrial membrane are underrepresented 
(Figure 1D). The incomplete coverage (55%) is most 
likely due to the finite sensitivity of the mass spectromet- 
ric methodology, which acts as a bias against proteins 
of low abundance. This explanation is supported by 
analysis of RNA expression of the genes encoding the 
detected and undetected proteins. Considering the sub- 
set of the 428 previously annotated proteins for which 
RNA expression was reported in an atlas of mRNA ex- 
pression in mouse (Su et al., 2002), the distribution of 
RNA expression level was about 5-fold higher for the 
genes whose products were detected in our proteomic 
survey as compared to those that were not (p = 1 x 
10" 21 ) (Figure 1E). This suggests that the proteomics 
strategy preferentially detected the higher abundance 
proteins. 

The 163 proteins not previously annotated as mito- 
chondrial potentially represent new mitochondrial pro- 
teins, either in the conventional sense of being present 
within the organelle or in a broader sense of being teth- 
ered to the mitochondrial outer membrane. To test this 
notion, we sought independent evidence that these 163 
proteins are actually mitochondrial (Supplemental Table 
S2). First, we compared the list to proteins identified in 
a recent survey of human heart mitochondria (Taylor et 
al., 2003). Human homologs of 88 of the 163 proteins 
were identified in this recently published study. Of the 
remaining 75 proteins, 19 (25%) have strong mitochon- 
drial targeting sequences based on bioinformatic analy- 
sis of protein targeting sequences (see Supplemental 
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Figure 1 . Previously Known and Newly Identified Mitochondrial Proteins (mito-A) 

(A) Proteomic survey of mitochondria from mouse brain, heart, kidney, and liver resulted in the identification of 399 protein clusters, 236 of 
which were previously annotated as being mitochondrial. The distributions for (B) molecular weight, (C) isoelectric point, and (D) mitochondrial 
compartments are plotted for proteins detected (red) or not detected (blue) by our proteomic survey. Isoelectric point, molecular weight, and 
subcellular distribution data came from the MITOchondria Project (MITOP [Scharfe et al., 2000]). (E) Cumulative distribution of mRNA abundance 
for those genes whose protein product was detected (red) or not detected (blue) by proteomics. The median expression levels for both groups 
are indicated. MIM, mitochondrial inner membrane; IMS, intermembrane space; and MOM, mitochondrial outer membrane. 
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Figure 2. Subcellular Localization of Proteins Not Previously Associated with the Mitochondrion 

GFP fusion proteins for human homologs of five of the newly identified proteins (A-E) were expressed in human 293 cells, counterstained 
with an antibody (a-GRP-75) directed against a known mitochondrial marker (F-^J), and imaged by confocal microscopy. Panels (K)-(0) show 
the overlay of the two images. (A) UK114 (translation^ inhibitor protein p14.5), homolog of Hrspl 2 (GenPept accession 6680277). (B) HINT2 
(histidine triad nucleotide binding protein 2), homolog of 1190005L05Rik (GenPept accession 12835711). (C) FU14668 (hypothetical protein), 
homolog of 2010309E21 Rik (GenPept accession 13385042). (D) YF13H12 (protein expressed in thyroid), homolog of 0610025L15Rik (GenPept 
accession 12963539). (E) NIT2 (Nit protein 2), homolog of Nit2 (GenPept accession 12963555). 



Table S2 and Experimental Procedures) (Nakai and Hor- 
ton, 1999), a proportion slightly lower than the known 
mitochondrial proteins (38%). Of those that do not have 
strong mitochondrial targeting sequences, seven show 
RNA expression patterns tightly correlated with known 
mitochondrial genes. For example, polymerase delta in- 
teracting protein 38 (encoded by Pdip38), which was 
detected only in liver mitochondria, and the gene prod- 
uct of Rnasehl, which was found only in the kidney, 
have strong mitochondrial targeting scores. The protein 
201 01 00O1 2Rik, which was detected in mouse liver and 
in kidney, appears to be an integral membrane protein 
whose gene expression is extremely tightly correlated 
with the known mitochondrial genes. Hence, the major- 
ity of the 163 newly identified mito-A members have 
multiple tiers of evidence supporting that they are mito- 
chondrial. 

To provide direct experimental evidence, we chose 
human homologs of five of the 163 newly identified 
mouse mito-A proteins and created GFP-tagged fusions 
to determine their subcellular localization by confocal 
microscopy (Figure 2). Four of these five showed exclu- 
sive mitochondrial staining, while one showed diffuse 
mitochondrial and cytosolic staining. Taken together, 
our analyses show that of the 163 mito-A proteins, 113 
have at least one additional tier of support (Supplemental 
Table S2), suggesting that the list of newly identified pro- 
teins is indeed highly enriched in mitochondrial proteins. 

The list of 163 proteins above includes many proteins 
of unknown function (Supplemental Table S2). For ex- 
ample, very little is known about the five proteins whose 
localization we confirmed. NIT2 (Figure 2E) and HINT2 
(Figure 2B), human homologs of proteins we identified, 
are both evolutionary conserved enzymes with putative 



roles in nucleotide metabolism and possibly in tumor 
suppression (Brenner et al., 1999). UK114 (Figure 2A) is 
the human homolog of mouse protein Hrspl 2, previously 
described as a liver protein that occurs as a dimer and 
is differentially expressed following heat shock (Samuel 
et al., 1 997). YF1 3H1 2 (Figure 2D) and FU1 4668 (Figure 
2C) are human homologs of mouse proteins we identi- 
fied that also appear to be exclusively mitochondrial 
based on microscopy studies. Other proteins identified 
in our study are poorly characterized, but based on their 
protein domains, could play very interesting roles in the 
mitochondrion. For example, the AAA-ATPase domain 
containing protein Tob3 may play a role in the assembly 
or degradation of mitochondrial protein complexes (Lu- 
pas and Martin, 2002). This list also includes a number of 
well-characterized proteins not traditionally associated 
with the organelle, including the glycolytic enzymes hex- 
okinase, aldolase, and glyceraldehyde 3 phosphate de- 
hydrogenase. Previous studies have suggested that 
these enzymes may be tethered to outer mitochondrial 
proteins, and several other recent proteomics studies 
have detected these proteins in their mitochondrial 
preparations (Taylor et al., 2003). Close proximity of this 
glycolytic machinery to the outer membrane of the mito- 
chondrion would serve an obvious biological function, 
since it produces pyruvate, which feeds into the Kreb's 
cycle in the mitochondrion. Our list also includes several 
proteins traditionally associated with the lysosome (e.g., 
cathepsin and saposin), which may play a role in mito- 
chondrial protein degradation. However, it is possible 
that these latter proteins merely represent contamina- 
tion by other organelles. 

Human homologs of two proteins identified by the 
proteomic survey are clearly involved in human disease. 
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The first is LET1, which is deleted in nearly all patients 
with Wolf-Hirschhorn syndrome (WHS) (Endele et al., 
1999). We identified this protein in mouse brain, heart, 
kidney, and in liver, and a recent study confirmed its 
mitochondrial localization (Taylor et al., 2003). The sec- 
ond is LRPPRC, encoding an mRNA binding protein, 
whose human homolog we recently identified as being 
mutated in a human cytochrome c oxidase deficiency, 
Leigh Syndrome, French Canadian variant (Mootha et 
al., 2003a). 

Clearly, additional studies are needed to fully validate 
the subcellular localization of all these 163 proteins 
(Supplemental Table S2) and to determine their function. 
While several bioinformatic tools are currently available 
for detecting mitochondrial targeting sequences (Nakai 
and Horton, 1999; Neupert, 1997), such predictions still 
suffer from poor sensitivity and specificity. With the grow- 
ing inventory of mito-A proteins, it may be possible to 
discover new protein targeting motifs and mechanisms. 

Concordance of mRNA Abundance 
and Protein Detection 

Next, we sought to determine whether protein detection 
in our proteomics experiments is broadly concordant 
with mRNA abundance of the corresponding gene mea- 
sured by oligonucleotide microarrays. The traditional 
approach to relating mRNA abundance to protein abun- 
dance is to calculate a simple correlation coefficient. 
However, protein detection by mass spectrometry and 
RNA expression analysis with microarrays can result in 
noisy data. For example, the protein product of a given 
gene may give rise to few or unfavorable tryptic peptides 
for mass spectrometric identification. Similarly, the oli- 
gonucleotide probes on the microarray may be imper- 
fect detectors for certain genes. Previous efforts to ana- 
lyze such noisy data with simple correlation analyses 
have resulted in positive but weak associations (Griffin 
et al., 2002; Lian et al., 2001) between mRNA and protein, 
while analyses with more robust statistics have yielded 
stronger correlations (Gygi et al., 1999). 

To decrease the effect of noisy data, we developed 
an RNA/protein concordance test that takes advantage 
of the availability of mRNA and protein measures across 
four tissues (see Experimental Procedures). If a given 
protein is detected in liver but not in heart, for example, 
we say that the mRNA abundance is concordant if the 
mRNA expression level in liver exceeds that in heart. 
The m RNA/protein concordance test overcomes those 
technical artifacts that are uniform for a given gene 
across different tissues. For a given gene, we can count 
the total number of concordant measures for all pairs 
of tissues and compare to the expected distribution 
of concordance in the null case in which there is no 
association between mRNA and protein detection (see 
Experimental Procedures). 

We applied this analysis to proteins identified in well- 
matched brain, heart, kidney, and liver batches for which 
we also had mRNA expression measures. We found that 
426 of the 569 pairwise comparisons were concordant, 
allowing us to strongly reject the null hypothesis that 
there is no association between protein detection and 
mRNA abundance (p = 3.0 x 10~ 14 ). Hence, on a bulk 
level, mRNA expression levels are indeed correlated to 



detection by proteomics. The fully discordant cases may 
represent genes whose mRNA and protein products are 
regulated via posttranscriptional mechanisms (Klausner 
and Harford, 1989), although some may reflect noise in 
the measurements. 

Abundance Differences across Tissues 
We next sought to investigate the degree to which mito- 
A transcripts and proteins exhibit compositional differ- 
ences across tissues. Of course, apparent absence of 
a gene product in these experiments cannot be distin- 
guished from a very low level of expression, so these 
surveys should be interpreted as revealing differences 
in the abundance of mitochondrial components across 
tissues. 

Of the 236 previously known mitochondrial proteins 
that were detected in our proteomic survey (Figure 1 A), 
about 40% were detected in all four tissues. mRNA ex- 
pression measures were available for 1 68 of these genes 
(Su et al., 2002). Using a previously established criterion 
that a gene is "expressed" (see Experimental Proce- 
dures), we found that 57% of these genes were ex- 
pressed in all four tissues. 

The fact that only about one-half of gene products 
are detected in all four tissues could reflect true differ- 
ences in the abundance of these components or an 
artifact from random under-sampling of the tissues by 
our methodologies. To distinguish these possibilities, 
we considered five well-matched experimental tissue 
batches: two independent liver samples and one sample 
from each of brain, heart, and kidney. We then computed 
the conditional probability that a protein detected in the 
first liver sample is also detected in a specified one of 
the other samples. The conditional probability of de- 
tecting the protein in another sample is 92% for the 
second liver sample (indicating good, although not per- 
fect reproducibility) but averages only 79% for brain, 
heart and kidney. The probability of detection in a dis- 
tinct tissue is therefore ~85% as large as the probability 
of redetection in the same tissue. The diversity of mito- 
chondrial protein composition across different tissues 
is thus substantially greater than can be accounted for 
by experimental noise alone, indicating that there are 
differences in protein composition between the tissues. 

We therefore sought to model the degree to which 
mito-A transcripts and proteins are shared across differ- 
ent tissues. We can define P, as the probability that a 
given protein is found in a set of / + 1 tissues, conditional 
on being found in a specific set of / tissues, averaged 
over all distinct subsets of tissues (see Experimental 
Procedures). Focusing on protein expression in four, 
well-matched tissue batches, we find that P, = 0.79, 
P 2 = 0.89, and P 3 = 0.93. And likewise, using the RNA 
expression data, we find Pi = 0.89, P 2 = 0.93, and P 3 = 
0.94. These results are broadly consistent with a simple 
theoretical model in which half of the mitochondrial com- 
ponents are present in all tissues and the other half 
being tissue specific such that they occur in a given 
tissue with 50% probability. Out of a hundred mitochon- 
drial proteins, two tissues would then each contain the 
50 ubiquitous mitochondrial proteins as well as 25 tis- 
sue-specific proteins, of which half would be shared 
(i.e., 62.5/75 or 83% proteins shared). In this way, this 




simple model would result in P^ = 0.83, P 2 = 0.90, and 
P 3 = 0.94 (see Experimental Procedures), very close to 
the degree of protein and transcript sharing across 
tissues. 

The notion that only a subset of mitochondrial proteins 
are shared (that is, present at detectable expression 
levels) among mitochondria from two different tissues 
is consistent with previous studies demonstrating mor- 
phological and functional specialization of this organ- 
elle. The consistency of RNA and protein expression 
analysis is important, since proteomics, but not RNA 
expression analysis, allows us to control for organelle 
copy number, which can vary across cell types. 

Subnetworks of Mitochondrial Genes 
Numerous studies have shown that functionally related 
sets of genes can often exhibit patterns of correlated 
gene expression (DeRisi et al., 1 997). We were interested 
in determining whether subsets of the 591 mito-A genes 
might exhibit distinct patterns of expression across dif- 
ferent tissues. For 386 of the 591 mito-A genes, mRNA 
expression measures were available in a mouse gene 
expression compendium containing data across 45 tis- 
sues (Su et al M 2002). 

We calculated pairwise correlations and performed 
hierarchical clustering of these 386 gene expression 
profiles (Figure 3). There are several striking mitochon- 
drial gene modules (Figure 3A), which we define as clus- 
ters of genes showing strong expression correlation 
across the 45 tissues (see Supplemental Table S3 for 
annotations of these genes). These modules include pre- 
viously known as well as newly identified members of 
mito-A (see bar labeling in Figure 3B). As shown in Figure 
3B, mitochondrial gene expression profiles vary tremen- 
dously from tissue to tissue, suggesting a regulatory 
diversity that is consistent with the compositional diver- 
sity noted above. 

Each of these gene modules is characterized by tightly 
correlated gene expression across the tissue compen- 
dium, but some are heavily enriched by members of 
well-known biochemical pathways. Members of these 
modules likely share transcriptional regulatory mecha- 
nisms as well as cellular functions. And because many 
of the newly identified mitochondrial genes (Figure 3B) 
lie within these modules, they provide an initial step 



Figure 3. Modules of Mitochondrial Genes 
(A) Pairwise correlation matrix for the 386 mitochondrial genes repre- 
sented on the GNF mouse tissue compendium (Su et al., 2002). 
Red represents strong positive correlation, blue represents strong 
negative correlation. Dominant gene modules are labeled 1-6 with 
annotations. (B) mRNA expression profile for 386 mitochondrial 
genes (rows) across 45 different mouse tissues performed in dupli- 
cate (columns) in the GNF mouse compendium. Genes and tissues 
were hierarchically clustered and visualized using DCHIP (Schadt 
et al., 2001). Selected tissues are labeled at the top of the panel. 
Evidence that a gene encodes a mitochondrial protein is indicated 
by the bars placed to the right of the correlogram: white, previously 
annotated but not found in proteomics; gray, not previously anno- 
tated but identified by proteomics; and black, previously annotated 
and found in proteomics. Annotations of these 386 genes are avail- 
able in Supplemental Table S3 (available online at http://www.cell. 
com/cgi/content/fu!l/1 1 5/5/629/DC1 ). 
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toward an understanding of their function. Of the 104 
probe-sets corresponding to newly identified mitochon- 
drial proteins, 38 fall within one of these modules, pro- 
viding them with a preliminary functional context (Sup- 
plemental Table S3). 

Modules Enriched in Genes 
of Oxidative Phosphorylation 

Perhaps the most striking subnetwork of mitochondrial 
genes is module 1 , consisting of 90 genes related to 
oxidative phosphorylation (OXPHOS), p-oxidation, and 
the TCA cycle, and are highly expressed in brown fat, 
skeletal muscle, and heart (Figure 3B). This module in- 
cludes 13 probe-sets corresponding to 12 newly identi- 
fied mito-A genes. Previous work has identified the bo- 
vine homolog of one of these proteins, Grim19, as a 
component of complex I of the electron transport chain 
(Fearnley et al., 2001). The other proteins, which, to our 
knowledge, have not been associated with oxidative 
metabolism, include Usmg5, Np15, D10Ertd214e, 
2010100O12Rik, 2610207116, Rik1110018B13Rik, 
261 0205H1 9Rik, 061 0041 L09Rik, 061 000601 7Rik, 
231 000501 4Rik, and Gbas. 

We recently showed that tightly correlated members 
of the OXPHOS biochemical pathway exhibit reduced 
gene expression in human diabetes (Mootha et al., 
2003b). It will be interesting to determine whether this 
property extends to this module, as well as what regula- 
tory mechanisms account for this striking pattern of 
correlated gene expression. 

Other Gene Modules 

Several of the other gene modules have clear functional 
associations. For example, module 2 contains 1 5 genes, 
a large fraction of which are involved in branched chain 
amino acid metabolism. This module also contains two 
of the four known biotin-dependent carboxylases. These 
pathways are highly expressed in brown fat— but not 
skeletal muscle and heart— as well as in liver, kidney, 
adrenal, and testis, raising hypotheses about tissue ca- 
pacities for amino acid metabolism. 

It has long been known that adrenal mitochondria play 
a central role in steroidogenesis. Several of the enzymes 
involved in this pathway, including steroidogenic acute 
regulatory protein (Star), ferredoxin reductase, and fer- 
redoxin, are ail found in module 3. Ferredoxin reductase 
is the sole mammalian P450 NADPH reductase, transfer- 
ring electrons from NADPH, via ferredoxin, to choles- 
terol. Under substrate limiting conditions, it is known 
that electrons from this system can generate a large 
load of reactive oxygen species (ROS) that can be 
quenched by scavenging enzymes (Hwang et al., 2001). 
Interestingly, module 3 also includes the ROS scavenger 
peroxiredoxin 3, which may serve this function. Two 
known heat shock proteins, Hspel and Hspdl , are also 
coordinately expressed in this module, though their role 
in steroid metabolism is not known. 

Module 6 includes genes involved in heme biosynthe- 
sis that form a tight cluster highly expressed in bone 
and in bone marrow. Of the four mitochondrial enzymes 
involved in heme biosynthesis (Stryer, 1988), aminolevu- 
linic acid synthetase, ferrochelatase, and coproporphy- 
rinogen oxidase are found within this module. Several 



genes encoding heme-containing proteins or involved 
with heme metabolism are also expressed in this cluster, 
as well as a newly identified mitochondrial protein, 
111 0021 D01Rik. 

The mitochondrial modules represent a first step to- 
ward a systematic, functional characterization of mito- 
chondrial genes. The modules can be used for functional 
discovery as well as for discovering c/s-elements in- 
volved in organelle remodeling. 

Mitochondrial Gene Expression Neighborhood 

The above studies focused on those genes whose prod- 
ucts are physically localized or associated with the mito- 
chondrion and attempted to characterize subnetworks 
within this group. We next sought to systematically iden- 
tify those genes that are coregulated with this set. We 
refer to this "mitochondrial neighborhood" as mito-CR, 
for mitochondria-co-regulated. The mito-CR set may 
contain genes not in the mito-A set and may encode 
proteins that are not physically associated with mito- 
chondria but which function coordinately with mitochon- 
drial processes. 

To define the mitochondrial neighborhood, we used 
the neighborhood index (N m ), a previously described 
statistic that measures a given gene's expression simi- 
larity to a target gene set (Mootha et al., 2003a). For 
a given gene, the mitochondria neighborhood index is 
defined as the number of mito-A genes among its near- 
est 1 00 expression neighbors. We applied neighborhood 
analysis to all genes in the mouse expression atlas (Fig- 
ure 4), which includes a total of 10,043 genes, including 
386 of the mito-A genes. We sought a threshold for N m 
that would define the boundary of the neighborhood. We 
found that an N 100 value of at least 15 (see Experimental 
Procedures) would be expected to occur by chance 
approximately 1 in 20 times, after correcting for multiple 
hypothesis testing (corresponding to a global p value 
of ~0.05). 

A total of 643 genes have N m > 15. We define this 
as the expression neighborhood of the mito-A set, and 
we interpret these genes as being coregulated with mito- 
chondrial genes (see the entire rank ordered list in Sup- 
plemental Table S3). This group corresponds to only 
6.4% of all the genes studied, but it contains 45% of 
the mito-A genes (7-fold enrichment). The list includes 
48 that are newly mitochondrial based on our proteomic 
survey and 18 that were previously known to be mito- 
chondrial but not detected by our proteomic survey. 

Importantly, the expression neighborhood mito-CR in- 
cludes 470 genes that are not present in the mito-A set 
itself. Some of these genes may encode proteins that 
are physically present in mitochondria but were missed 
in our proteomic survey, while others may encode pro- 
teins that are functionally related to mitochondria but 
not physically associated. The neighborhood mito-CR 
thus provides a catalog of genes that are likely function- 
ally relevant to mitochondrial biology and is complemen- 
tary to the proteomic approach that identified proteins 
resident in this organelle. 

Transcriptional Regulators within 
the Mitochondrial Neighborhood 

Because tissue-specific transcription factors are often 
involved in specifying tissue differentiation, we rea- 
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Figure 4. Mitochondria Neighborhood Analysis 
The mitochondria neighborhood index (/V 10 o) 
is defined as the number of mito-A genes 
that occur within the nearest 100 expression 
neighbors of a given gene (Mootha et al., 
2003a). The distribution of /v 100 is plotted for 
all genes (white), mito-A genes that are not 
identified as ancestral (hashed), and for the 
ancestral mito-A genes (black). 
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soned that the expression neighborhood might contain 
genes encoding transcriptional regulators of organelle 
biogenesis (Table 1). While none of these factors have 
previously been shown to exhibit expression patterns 
correlated with mitochondrial genes, several have pre- 
viously been implicated in mitochondrial biology. For 



example, Ppar7, Ppara, and Esrra are nuclear receptors 
that are involved in adipogenesis and fatty acid metabo- 
lism and coactivated by PGC-1a, a regulator of mito- 
chondrial biogenesis (Puigserver and Spiegelman, 2003). 
A number of other transcription factors, including Nfix, 
Tbx6, and Klf9 exhibit patterns of correlated expression 



Table 1. Genes in the Mitochondria Expression Neighborhood with Putative Roles in DNA Maintenance and Repair 
Gene name Gene symbol W 100 



Transcriptional regulators 



MyoD family inhibitor 


Mdff 


63 


nuclear factor l/X 


Nfix 


60 


zinc finger protein 288 


Zfp288 


56 


T-box6 


Tbx6 


49 


Cofactor required for Sp1 transcriptional activation subunit 2 


Crsp2 


47 


RIKEN cDNA 9130025P16 gene 


9130025P16Rik 


46 


Kruppel-like factor 9 


Klf9 


43 


EGL nine homolog 1 


Eglnl 


39 


Estrogen related receptor, alpha 


Esrra 


36 


nuclease sensitive element binding protein 1 


Nsepl 


34 


sirtuin 1 (silent mating type information regulation 2, homolog) 1 


Sirtl 


31 


peroxisome proliferates activated receptor alpha 


Ppara 


29 


metastasis associated 1 -like 1 


MtaM 


28 


NK2 transcription factor related, locus 5 (Drosophila) 


Nkx2-5 


27 


cardiac responsive adriamycin protein 


Crap 


24 


homeo box D8 


Hoxd8 


21 


nuclear receptor subfamily 1 , group I, member 2 


Nr1i2 


21 


nuclear receptor subfamily 1 , group H, member 3 


Nr1h3 


20 


cellular nucleic acid binding protein 


Cnbp 


19 


transcription factor 2 


Tcf2 


19 


Est2 repressor factor 


Erf 


19 


nuclear receptor subfamily 5, group A, member 1 


Nr5a1 


18 


nuclear factor, erythroid derived 2,-like 1 


Nfe2l1 


18 


zinc finger protein 30 


Zfp30 


17 


peroxisome proliferator activated receptor gamma 


Pparg 


17 


cAMP responsive element binding protein 1 


Crebl 


16 


SRY-box containing gene 6 


Sox€ 


15 


CCAAT/enhancer binding protein (C/EBP), alpha 


Cebpa 


15 


DNA repair 






mutL homolog 1 


Mlh1 


29 


mutS homolog 5 


MshS 


24 


excision repair cross-complementing rodent repair deficiency, 


Ercd 


15 



complementation group 1 
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that make them candidates for involvement in organelle 
remodeling. Surprisingly, the nutrient sensor Sir2 is also 
found within the mitochondrial expression neighbor- 
hood. Sir2 encodes an NAD(+)-dependent histone de- 
acetylase involved in gene silencing, chromosomal sta- 
bility, and aging. Chromatin remodeling enzymes rely on 
coenzymes derived from metabolic pathways, including 
those generated by the mitochondrion. Our observa- 
tions suggest that Sir2 and mitochondrial gene expres- 
sion are coordinately regulated, providing a potential 
regulatory link between the mitochondrion and the nutri- 
ent sensing activities of Sir2. 

DNA Repair Enzymes within the 
Mitochondrial Neighborhood 

Identifying proteins involved in mtDNA repair has been 
extremely challenging. These proteins are believed to 
occur in low abundance, and when found in mitochon- 
drial preparations, it is difficult to preclude the possibility 
of nuclear contamination. Although mtDNA mismatch 
repair activity has been reported in human cells (Mason 
et al., 2003), a mammalian mtDNA mismatch repair en- 
zyme has not yet been identified. This has been puzzling, 
since yeast mitochondria have a mutS homolog (Chi and 
Kolodner, 1994). 

The mitochondria expression neighborhood contains 
genes encoding two mammalian mismatch repair en- 
zymes, MshS and Mlh1. Msh5, a mammalian MutS ho- 
molog, has previously been described to be required 
for meiotic progression (Edelmann et al., 1999), but no 
association with mitochondria has been noted. Interest- 
ingly, Mlh1 , a mammalian MutL homolog, has previously 
been reported to be involved in repair of DNA following 
oxidative stress (Hardman et al., 2001). Supporting the 
notion that Msh5 and Mlh1 function in mitochondria, we 
find evidence by bioinformatic analysis (Claros, 1995; 
Nakai and Horton, 1999) that these two proteins contain 
reasonable mitochondrial targeting sequences. 

While our findings by no means prove that these en- 
zymes are involved in mtDNA repair, their strong corre- 
lated expression with mitochondrial genes and their 
mitochondrial targeting sequences make them very at- 
tractive candidates for mediating these repair activities. 

Mitochondrial Gene History 

The dual origin hypothesis suggests that the modern 
mitochondrial proteome can be divided into two groups, 
consisting of proteins derived from their eu bacterial an- 
cestry, while the remaining proteins have been acquired 
over the last 2 billion years (Andersson et al., 1998; 
Karlberg et al., 2000). We consider a gene to be an 
ancestral mitochondrial gene if it has a detectable or- 
tholog in Rickettsia prowazekii, the nearest eubacterial 
relative to mammalian mitochondria (Andersson et al., 
1998). Of the mito-A genes for which we had gene ex- 
pression measures, 54 can be identified as being ances- 
tral (see Experimental Procedures). We find that the an- 
cestral mitochondrial genes tend to have higher local 
enrichment by mitochondrial genes, as assayed by the 
neighborhood index (Figure 4). Interestingly, previous 
studies have suggested that mRNA populations encod- 
ing ancestral mitochondrial proteins tend to be trans- 
lated at polysomes associated with the mitochondrial 



outer membrane (Marc et al., 2002). The current result 
(Figure 4) hints that ancestral mitochondrial genes may 
exhibit a pattern of gene expression distinct from the 
other mitochondrial proteins, hence providing an addi- 
tional signature of their history. 

Conclusion 

We have performed a large-scale proteomic survey of 
mitochondria purified from four different mouse tissues 
and have analyzed the results in the context of existing 
annotations and publicly available gene expression pro- 
files. Integration of these datasets provides a first step 
toward a functional annotation of these newly identified 
proteins as well as an understanding of the regulatory 
organization of all mitochondrial genes. 

Our proteomic analysis is best thought of as a survey 
of the abundant mitochondrial proteins. Clearly, the 
mito-A list is incomplete. Based on comparisons to pre- 
viously known mitochondrial genes, our proteomic sur- 
vey appears to have a sensitivity of 55% and thus would 
be predicted to have missed 133 novel mitochondrial 
proteins; this would suggest that the true number of 
mito-A genes is at least 725. Because the well annotated 
proteins likely represent the more abundant proteins, 
amenable to analysis by traditional biochemical ap- 
proaches, this estimate likely represents a lower bound 
on the mitochondrial proteome. Future proteomic sur- 
veys of the mitochondrion aimed at expanding the inven- 
tory of mitochondria] proteins may benefit from higher 
dimensional chromatography and improved sample 
preparation, more sensitive and quantitative mass spec- 
trometry technologies (Aebersold and Mann, 2003), and 
perhaps use of genetic strategies (Ozawa et al., 2003). 
When combined with genome- wide expression microar- 
rays, it should be possible to more comprehensively 
reconstruct pathways within the mitochondrion and to 
determine the extent to which mitochondrial diversity 
extends to other cell types and to lower abundance 
gene products. 

Proteomics and RNA expression profiling provide 
complementary insights. The mito-A list consists of 591 
genes whose products reside in or in close association 
with the mitochondrion, while the mitochondrial expres- 
sion neighborhood includes a large group of 643 genes 
whose transcription profiles are tightly correlated to 
those of mito-A. The expression neighborhood mito-CR 
contains a large fraction of the mito-A genes assayed 
in the expression survey (including some that had not 
been detected in the proteomic survey), as well as many 
additional genes. Some of these additional genes may 
encode products that actually reside in the mitochon- 
dria, while others may encode products that reside else- 
where but are related to mitochondrial biogenesis and 
function. In the future, it will be valuable to combine 
insights from complementary approaches, as sensitivity 
and specificity measures can be improved by combining 
different sources of experimental evidence. 

At present, the mechanisms that achieve cell-type- 
specific differences in mitochondrial form and function 
are not known. How a mitochondrion remodels in re- 
sponse to changes in nutrient status and energy de- 
mands or in disease states, such as cancer and diabe- 



Cell 
638 



tes, is poorly understood. It is likely that transcriptional 
mechanisms work in concert with mRNA processing 
and protein-targeting mechanisms to carefully achieve 
appropriate enzymatic stoichiometrics required for each 
mitochondrion. Deciphering these mechanisms is an im- 
portant challenge. Mitochondrial modules serve as an 
excellent starting point for identifying important c/s-reg- 
ulatory elements, and the genes whose protein and RNA 
expression levels are discordant may guide the identifi- 
cation of new posttranscriptional regulatory mecha- 
nisms. Finally, an expanded list of mitochondrial pro- 
teins may assist in identifying new organelle targeting 
sequences. 

Given the central role of the mitochondrion in the life 
and death of the cell, it is likely that the mitochondria- 
associated genes and those in the expression neighbor- 
hood represent a rich source of candidate genes for 
human disease as well as targets for future drug devel- 
opment. Such therapies may exploit the apparent com- 
positional and regulatory diversity within this organelle 
to provide treatment specificity for pathways operative 
in human disease. 

Experimental Procedures 

Organelle Purification and Sample Preparation 
Six- to eight-week-old male mice were subjected to an 8 hr fast and 
then euthanized. Brain, heart, kidney, and livers were harvested 
immediately and placed in ice-cold saline. Mitochondria were iso- 
lated using differential centrifugation as previously described and 
purified with a Percoll gradient (Mootha et al., 2003a). To test the 
purity of these preparations, we performed Western blot analysis 
as previously described, using antibodies directed against known 
mitochondrial proteins (cytochrome c, COXIV, and VDAC) as well 
as antibodies directed against calreticulin (a marker for the endo- 
plasmic reticulum) and for SNAP25 (a marker for synaptosomes). 
The proteins were then solubilized, size separated, and digested as 
previously described (Mootha et al., 2003a). 

Tandem Mass Spectrometry 

Liquid chromatography tandem mass spectrometry (LC- MS/MS) 
was performed on QSTAR pulsar quadrupole time of flight mass 
spectrometers (AB/MDS Sciex, Toronto) as described previously 
(Mootha et al., 2003a). Tandem mass spectra were searched against 
the NCBInr database (October 2002) with tryptic constraints and 
initial mass tolerances <0.13 Da in the search software Mascot 
(Matrix Sciences, London). Only peptides achieving a Mascot score 
above 25 and containing a sequence tag of at least three consecu- 
tive amino acids were accepted. 

Curation of Previously Annotated Mitochondrial Proteins 
We used two key sources to identify previously annotated proteins. 
First, we downloaded the human and mouse protein sequences at 
MITOchondria Project (Scharfe et al., 2000). We also downloaded 
the 199 human and 290 mouse protein sequences annotated at 
LocusLink (http://www.ncbi.nlm.nih.gov/Locusl_ink) as having a mi- 
tochondrial subcellular localization based on gene ontology termi- 
nology (GO:0005739) (January 2003). We also included in our master 
list the 1 3 mtDNA encoded proteins, based on LocusLink annotation. 

A Nonredundant List of Mitochondrial Proteins 
FASTA sequences corresponding to the previously annotated mito- 
chondrial proteins, newly identified mitochondrial proteins, and the 
mouse Reference Sequences (August 2003) (Maglott et al., 2000) 
were merged. These were then collapsed into distinct protein clus- 
ters using a command-line version of Wastclust (http://www.ncbi.nlm. 
nih.gov/BLAST/). We required that members of a cluster demonstrate 
70% sequence identity over 50% of the total length, not requiring 
a reciprocal relationship to exist. Clusters containing multiple refer- 
ence sequences were then broken using a higher stringency blast- 



clust, in which we required 90% identity over 50% of the length 
and then manually reviewed. Some clusters were eliminated if they 
consisted of sequences that were annotated as fragments. Each 
protein cluster consists of accessions corresponding to previously 
annotated mitochondrial proteins, as well as accessions of proteins 
identified directly in the proteomics experiments. Hence, some clus- 
ters are supported by proteomics alone or by annotations alone, 
while others have support from both (Figure 1A). Most of these 
clusters also contain a reference sequence, which serves as an 
exemplar representative sequence for that cluster. Some clusters 
did not have reference sequences, so a mouse protein sequence 
was manually identified through iterative NCBI BLAST routines. 

This procedure resulted in a total of 601 protein clusters (Supple- 
mental Table S1 available online at http://www.cell.com/cgi/coritent/ 
fuIl/115/5/629/DC1). Ten clusters consisted of actin, hemoglobin, 
keratin, lysozyme, trypsin, or tubulin. These were flagged as ex- 
pected contaminants. While they are included in the list in Supple- 
mental Table S1 , they were eliminated from all subsequent analyses. 
Hence, there are a total of 591 mito-A protein clusters (Figure 1A). 
Of these 591 clusters, 399 were previously annotated as being mito- 
chondrial in LocusLink, in MITOP, or based on the name of the gene. 

Note that the data presented in Supplemental Table S1 comes 
from a total of 1 2 experimental batches, where each batch corre- 
sponds to a single tissue from a single mouse. We performed more 
proteomics experiments on mouse liver, and all this data is included 
in our Supplemental Table S1. However, in analyses of mRNArpro- 
tein concordance and in analysis of the compositional diversity, we 
limited our analyses to four well matched batches, corresponding 
to mouse brain, heart, kidney, and liver. 

Cell Culture and Transfection 

GFP-tagged proteins were generated for five human homologs of 
the identified proteins using the Gateway cloning system (Life Tech- 
nology) as described by the manufacturer. Approximately 6 x 10 s 
HEK 293 cells were seeded on coverslips in a 6 well-plate and 
incubated overnight in DM EM supplemented with 10% FBS, 100 
U/ml penicillin and 1 00 ^.g/ml streptomycin at 37°C in a humidified 
5% carbon dioxide atmosphere. Six microliters Genejammer (Stra- 
tagene) in 1 00 ^1 DM EM was incubated 1 0 min at room temperature 
and 1 iig DNA was added. The mixture was then incubated for a 
further 10 min. Nine hundred microliters of DMEM with 10% FBS 
and the transfection mixture were combined and added to the cells. 
After 3 hr, 1 ml of DMEM with 10% FBS and antibiotics were added. 
These transfected cells were then incubated for 48 hr. 

Immunofluorescence Microscopy 

Transfected cells were washed with PBS and fixed with 4% para- 
formaldehyde in phosphate buffered saline (PBS) for 1 5 min at room 
temperature. Ceils were washed three times with 1 00 mM glycine 
in PBS and permeabilized by a three minute incubation in PBS with 
0.2% Triton X-100. Then the cells were incubated in 1% BSA to 
prevent nonspecific staining. Mitochondria were stained with 
a-GRP-75 antibody (Stressgen) diluted 1 :200 in 1 % BSA in PBS for 
one hour. Cells were washed three times with PBS and incubated 
with 1 0 jig/ml of the secondary antibody Alexa Fluor 568 goat anti- 
mouse IgM A21043 (Molecular Probes) for 30 min. After three 
washes with PBS the coverslips were mounted in anti-fade mounting 
media and the subcellular distribution of these proteins analyzed 
by confocal microscopy. 

RNA/Protein Concordance Test 

We developed the RNA/protein concordance test to determine 
whether there is significant association between protein detection 
in a proteomics experiment and mRNA abundance in a microar- 
ray experiment. 

Consider the pair of tissues, /,/, where /,/ G {brain, heart, kidney, 
liver). For a given gene, G, we let M(G,k) represent the gene expres- 
sion level of gene G in tissue k. Let P(G,k) be an indicator variable 
that is 0 if the protein product of gene G is not found in tissue k, 
and 1 if the protein product is found in tissue k. We set 

1 , if M(GJ) > M[G,j) and P(G,y) > P{G,j) 

xU i) = " 1Jf M(G '° > M{Gti) 30(1 P{GJ) < P{G,J) 
K 0, otherwise 
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and define the concordance for gene G, C G , by 

u 

In the null case in which there is no association between protein 
detection and mRNA abundance, the expected concordance for a 
gene is 0. The variance in concordance for 6, denoted by v G , de- 
pends on the number of tissues, k, in which that gene's product 
was detected. If the protein product was detected in k = 0 or 4 
tissues, then C G and v G are both 0. If the protein product was de- 
tected in exactly one or three tissues, then the possible concordance 
measures are +3, +1, 0, -1, or -3. Again, because the expected 
concordance is 0, the variance under the null model is simply 
[(+3) 2 + (+ 1) 2 + (0) 2 + (-1) 2 + {-3f[l4 = 5. Finally, if the protein 
product was detected in exactly two tissues, then the possible con- 
cordance measures for the gene are +4, +2, 0, 0, -2, or -4, and 
hence, the null variance is [(+4) 2 + (+2) 2 + (0) 2 + (0) 2 + (-2) 2 + 
(-4) 2 ]/6 = 20/3. More generally, if the protein was detected in k 
tissues out of n that were surveyed, it can be shown that v G = 
k(p-k)(n + 1)/3. 

We compute the observed concordance and null variance for 
every gene and sum over all genes. Our test statistic then becomes 

which is approximately normally distributed with mean 0 and vari- 
ance 1 in the null case where there is no association between RNA 
abundance and protein detection. 

Compositional Diversity Across Tissues 

Mitochondrial gene products show distinct patterns of expression 
based on protein and RNA expression. These patterns of distribution 
motivate a simple model that describes core mitochondrial proteins 
versus those that are specialized to any set of cell types. Consider 
a set of i + 1 tissues, S, +1 , as well as a distinct subset $„ i.e., S, c 
S, +1 , where / > 0. We are interested in the probability that a given 
gene product is found in S, , , conditional that it is found in S it or 
simply r(S i+1 , Sj) = P (gene product is found in S i+1 |gene product is 
found in SJ. We define P, as the average T{S i+u S) over all selections 
of Si c S /+1 . When we assessed compositional diversity using RNA 
expression levels, we interpreted an RNA expression level greater 
than 200 as "expressed" (Su et al., 2002). 

These average conditional probabilities P, can also be modeled. 
Imagine that a fraction f of all mitochondrial proteins are ubiquitous 
(i.e., expressed in all cell types with probability 1) and that a fraction 
1 - f are not ubiquitous, but rather, appear in a given tissue with 
probability p. Then P, = (f + (1 - f)P i+ W + 0 - W- 

DNA Microarray Analysis 

To identify Affymetrix probe-sets corresponding to each protein 
cluster, we mapped the exemplar protein accession to its LocusLink 
ID, then to its Unigene cluster, and then identified the corresponding 
Affymetrix MG-U74Av2 probe set using the NetAffx website (http:// 
www.affymetrix.com) and its annotation tables (August 2003). Note 
that this automated mapping does not guarantee every protein is 
mapped to a probe-set ID; the majority of mito-A exemplars could be 
mapped to Affymetrix probe sets, but we know that the automated 
procedure has failed to provide a corresponding probe-set. Note 
that the mapping is largely 1 :1 , but there are some many: many map- 
pings. 

The GNF mouse expression atlas (Su et al., 2002) was downloaded 
from its website (http://expression.gnf .org). In comparisons of pro- 
tein detection and mRNA abundance, we used the mRNA expression 
level for a given tissue averaged over the replicates, since the GNF 
mouse expression atlas includes duplicates for each tissue. Be- 
cause we performed the proteomic survey on whole brain, we simply 
compared to the average expression of all brain samples in the GNF 
mouse atlas. Hierarchical clustering was performed using DCHIP 
(Schadt et al., 2001), using 1 - r as the distance metric, where r is 
the Pearson correlation coefficient, and the relative expression lev- 
els are displayed. 

Neighborhood analysis was performed using a stand-alone Perl 



script that was previously described (Mootha et al M 2003a). We used 
the GNF mouse expression atlas for these analyses. Of the 1 0,043 
genes represented in this atlas, 386 correspond to the mito-A genes. 
These 386 genes form the target gene set in neighborhood analysis. 
For each query gene in the atlas, we rank order all other genes in 
the atlas on the basis of Euclidean distance of gene expression. 
The neighborhood index, rV 100 , is defined as the number of mito-A 
genes within the top 100 ranking genes. If the 386 mito-A genes 
were a random subset of the 10,043 genes, then the probability of 
detecting at least 1 5 mito-A genes in a random sample of 1 00 genes 
is 6.7 x 10" 6 , corresponding to a Bonferroni corrected p-value (for 
the 1 0,043 measures made) of 0.07. 

Identification of Ancestral Mitochondrial Genes 
We downloaded the consensus FASTA sequences for the genes 
represented on the Affymetrix MG-U74Av2 oligonucleotide array 
from the NetAFFX (Liu et al., 2003) website (http://www.affymetrix. 
com). We performed a blastx comparison of these sequences 
against the Rickettsia prowazekii protein sequences, downloaded 
from the NCBl, and then performed a tblastn comparison of the 
bacterial protein sequences against the consensus FASTA se- 
quences. In both analyses, default blast parameters were used in 
conjunction with the BLOSUM62 scoring matrix. We defined an 
ancestral gene as one achieving a BLASTX E < 0.01 and having a 
reciprocal best match in the above BLAST analysis. 
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A major advantage of the mouse model lies in the increasing 
information on its genome, transcriptome, and proteome, as well 
as in the availability of a fast growing number of targeted and 
induced mutant alleles. However, data from comparative transcrip- 
tome and proteome analyses in this model organism are very 
limited. We use DNA chip-based RNA expression profiling and 2D 
gel electrophoresis, combined with peptide mass fingerprinting of 
liver and kidney, to explore the feasibility of such comprehensive 
gene expression analyses. Although protein analyses mostly iden- 
tify known metabolic enzymes and structural proteins, transcrip- 
tome analyses reveal the differential expression of functionally 
diverse and not yet described genes. The comparative analysis 
suggests correlation between transcriptional and translational 
expression for the majority of genes. Significant exceptions from 
this correlation confirm the complementarities of both approaches. 
Based on RNA expression data from the 200 most differentially 
expressed genes, we identify chromosomal colocalization of 
known, as well as not yet described, gene clusters. The determi- 
nation of 29 such clusters may suggest that coexpression of 
colocalizing genes is probably rather common. 

coexpression and colocalization | comparative expression profiles 

Most biochemical processes within and between cells are put 
into effect by the interaction between proteins, or between 
proteins and their substrates (1-3). The proteome of a cell is the 
result of controlled biosynthesis and, therefore, is largely (but not 
exclusively) regulated by gene expression (4). Vice versa, the 
transcriptome can be regarded as a sensitive read-out of the 
proteome or the biochemical state of the cell. Thus, transcriptome 
and proteome feed back to each other in a highly complex way. The 
understanding of this functional regulation is generally limited to 
distinct signaling or metabolic pathways. To begin to understand the 
mutual regulatory interactions between transcriptome and pro- 
teome, a comparative approach including the simultaneous mon- 
itoring of expression at the RNA and protein levels will be required. 

The basic technologies for genome-wide expression analyses at 
the mRNA (5-7) and protein levels (8-10) are available. Transcript 
profiling was used to assess normal variability in gene expression 
levels of mouse liver, kidney, and testis (11) and to analyze changes 
in expression patterns during embryonic and fetal liver develop- 
ment (12). So far, comparative transcriptome and proteome anal- 
yses in complex organisms are very limited and have been per- 
formed in human platelets (13) and heart tissue (14), and in the 
Anopheles and Culex salivary glands (15, 16). In rodents, the 
proteome of mouse primary islet cells was correlated with RNA 
expression data of purified primary rat beta cells, suggesting a close 
correlation between mRNA and protein expression (17). A parallel 
analysis of transcripts and proteins at a genomic scale in identical 
mouse tissue samples has not been performed. 

We use DNA chip-based expression profiling, 2D gel electro- 
phoresis, and subsequent peptide mass fingerprinting (PMF) to 



explore the general feasibility of such a comparative gene expres- 
sion analysis. A comparison of RNA and protein expression profiles 
from adult male mouse liver and kidney was made. The choice of 
different tissues provided a large set of differentially expressed 
proteins and genes. We used this set of differential expression 
profiles as a tool to address three major questions. (/) Does protein 
expression correlate with transcriptional regulation for the most 
differential proteins? (ii) Do transcriptomics and proteomics ap- 
proaches detect functional categories with different preferences? 
(Hi) Does coregulated gene expression correlate with colocalization 
in the genome? 

Materials and Methods 

Mouse Tissues. Breeding of wild-type C3HeB/FeJ mice was under 
specified pathogen-free conditions. Left kidney and dorsal lobe of 
the liver were collected at the age of 105 days (+/-5 days) from 
male mice, killed between 9:00 a.m. and 12:00 noon by C0 2 
asphyxiation. Organs were immediately frozen in liquid N 2 . 

Protein Isolation. For pH gradient 4-7, 50 mg of tissue was ground 
in liquid N 2 . Ten milligrams was dissolved in 200 ptl of lysis buffer 
{7 M urea/2 M thiourea/2% DTT/4% CHAPS (3-[(3-cholami- 
dopropyl)dimethylammonio]-l-propanesulfonate)/0.8% Pharma- 
lyte 3-10} and sonicated for 10 cycles for 1 s (60 W). The sample 
was kept shaking for 30 min at 25°C, and centrifuged for 5 min at 
20,000 X g. Protein concentration was determined by a modified 
Bradford method. Two hundred fifty micrograms of protein was 
loaded onto each 4-7 immobilized pH gradient strip. 

For pH gradient 6-11, 15 mg of the tissue powder was suspended 
in 200 jjd of 4°C trichloroacetic acid (TCA)/acetone 20% (vol/vol)/ 
50% (vol/vol). After sonication for 15 min (30 W) in a 4°C water 
bath, the suspension was diluted with 1.2 ml of TCA/acetone 20% 
(vol/vol)/50% (vol/vol), vortexed for 2 min, kept at 4°C for 16 h and 
centrifuged for 30 min at 20,000 X g (4°C). The pellet was washed 
twice in 200 fi\ of acetone, sonicated in a 4°C water bath for 20 min, 
centrifuged for 30 min at 20,000 X g (4°C), resuspended in 200 /xl 
of lysis buffer, and sonicated 10 times for 1 s (60 W) on ice. Samples 
were kept shaking at room temperature for 45 min and spun down 
for 5 min at 20,000 X g. A modified Bradford protein determination 
was done. Three hundred micrograms of protein was loaded onto 
6-11 immobilized pH gradient strips. 

2D Gel Electrophoresis (2-DE). Isoelectric focusing (IEF) was done 
with 18-cm immobilized pH gradient strips from Amersham 
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Pharmacia Bioscience. For each sample, five gels with gradients 
pH 4-7 and 6-11 were made. After focusing to the steady state, 
the strips were loaded with SDS and equilibrated in DTT and 
iodacetamide (18). 

The second dimension was performed as SDS/PAGE. T12% 
and C2.8% SDS gels were run vertically in a Hofer ISO-Dalt 
chamber by using the Laemmli buffer system. The SDS/PAGE was 
stopped when the bromophenol blue front ran off the gel; 1,800- 
2,000 Vh were applied. Gels were stained with Sypro Ruby and 
scanned with a Fuji fluorescence scanner. Mastergels were made 
from five replicas of one tissue in each gradient and matched 
(proteomweaver, Definiens AG, Munich). Statistical calcula- 
tions were performed allowing a standard deviation of <30% and 
a confidence level of 0.05 in the / test. Coomassie-stained micro- 
preparative gels were run with 500 /ig of protein per gel. 

PMF MALDI-TOF. Proteins were identified by PMF MALDI-TOF. 
Spots were picked from SDS gels, washed three times with 10 mM 
NH4HCO3 and 30% acetonitril (ACN), incubated overnight in 5 y\ 
25 ng/pd trypsin (Roche Diagnostics)/10 mM NH4HCO3 (pH 8) at 
37°C, and sonicated for 20 min at 25°C, and the supernatant was 
concentrated in a SpeedVac. The solution was processed through 
a CI 8 reversed phase ZipTip column (Millipore) by using 0.1% 
trifluoroacetic acid (TFA) and 80% ACN for elution. Eluted 
peptides were put on MALDI target and cocrystallized with 1 jxl 
of dihydroxybenzoeic acid. MALDI-TOF analysis (Voyager STR, 
Applied Biosystems) was done in reflector mode in the mass range 
of 700 to 4,000 daltons. Spectra were matched with the National 
Center for Biotechnology Information database to identify the 
corresponding protein. 

DNA Chip Expression Profiling. Total RNA was isolated according to 
manufacturer's protocols by using RNeasy kits (Qiagen, Hilden, 
Germany). The concentration of total RNA was measured by 
OD260/280 reading. Per DNA chip, 20 /utg of total RNA were used for 
reverse transcription and indirectly labeled with Cy3 or Cy5 fluo- 
rescent dyes according to The Institute for Genomic Research 
(TIGR) protocol as described (19). Probes were PCR amplified 
from the 20,000 (20k) mouse arrayTAG clone set as described (20). 
Amplified probes were dissolved in 3X SSC and spotted on 
aldehyde-coated slides (CEL Associates, Pearland, TX) by using 
the Microgrid TAS II spotter (Biorobotics Genomic Solutions, 
Huntingdon, U.K.) with Stealth SMP3 pins (Telechem, Sunnyvale, 
CA). Spotted slides were rehydrated, blocked, denatured, and dried 
as described (19, 21). The hybridization mixture was placed on 
prehybridized microarrays and hybridized at 42°C for 18-20 h. 
Microarrays were immersed in 40 ml of 3 X SSC and then succes- 
sively washed in 40 ml of 1 X SSC, 40 ml of 1 X SSC/0.1% SDS, and 
40 ml of 0.1 X SSC at room temperature as described (19). Dried 
slides were scanned with a GenePix 4000A scanner and analyzed by 
using the GENEPIX pro 3.0 image processing software (Axon In- 
struments, Union City, CA). Gene expression data have been 
submitted to the Gene Expression Omnibus database. 

Simulation of Gene Distribution. The probability to obtain clusters 
was first simulated by generating random distributions of spots 
along the genome. The second simulation is based on the random 
selection of genes from the published mouse genome sequence. To 
reduce redundancy in this list (University of California, Santa Cruz 
(UCSC) Genome Browser, known genes track), we filtered those 
genes that start before the middle of a previous gene. The upper 
confidence bounds were derived under the standard binomial 
model from the number of successful simulations. The procedure 
was written in the statistical language R and is available under 
http://ibb.gsf.de/homepage/volkmar.liebscher/genom/mousesim. 
html. 



Results 

Differential Transcriptome of Mouse Liver and Kidney. To analyze the 
differential transcriptome of mouse liver and kidney, RNA expres- 
sion profiling was performed with cDNA microarrays (22) con- 
taining a sequence-verified 20,200 mouse clone set (19, 21). Sixteen 
dual-color DNA chip hybridizations of cDNAs from age-matched 
C3HeB/FeJ male mice were made (Table 1, chips 1 a-f, 2 a-f, and 
3 a-d, which is published as supporting information on the PNAS 
web site). For each individual mouse, six or four replicate hybrid- 
izations were done. Between 16,092 and 19,592 probes had detect- 
able signals in individual chips, and 9,042 probes had signals in all 
microarrays (Table 1). 

We first analyzed the significance of genes with signals on all 16 
microarrays. Based on expectations from random permutations of 
genes and expression ratios, the selection of the top 1,802 differ- 
entially expressed genes would include one or more reproducibly 
regulated (false positive) genes by chance with P < 0.01 (Table 1). 
Of the 1,802 differentially expressed genes, 821 were more abun- 
dant in liver than in kidney, and 981 genes were more abundant in 
kidney as compared with liver (fully listed in Table 2, which is 
published as supporting information on the PNAS web site). In 
addition, genes were ranked based on the lowest absolute signal 
intensity ratio in 16-chip hybridizations regardless of reproducibil- 
ity. Although this criterion does not a priori select constant gene 
expression patterns, no nonreproducible gene expression, in terms 
of inconsistencies in up- or down-regulation in 16 repetitions, was 
found within the 470 strongest differentially expressed genes (Table 
3, which is published as supporting information on the PNAS web 
site). Based on statistics, we expect this selection to contain one or 
more nondifferentially expressed (NDE) genes with a significance 
level of P < 0.01. The numbers of actual nonreproducible genes and 
expected NDE (false positive) genes for P < 0.01 are given for 
different gene selections in Table 3, confirming the reliability of the 
data. Additional confidence in the gene expression data is gained 
from the fact that independent probes for the same gene result in 
similar expression ratios (see, for example, probes for Cai, Scp2 
Mup, Car3, Arg-1, and Akrla4 in Table 4, which is published as 
supporting information on the PNAS web site). Also, the specificity 
of the probes used on our DNA chip was recently assessed 
experimentally (20). 

Colocalization of Differentially Expressed Genes. Analyzing the chro- 
mosomal localization we found that the orthologues of the human 
proximal SERPJN subcluster were reproducibly coexpressed (23). 
The genes SerpinalO (rank no. 60 in Table 2), Serpinad (rank no. 
110), Serpinalb (rank no. 31), Serpinald (rank no. 61), Serpinala 
(rank no. 75), and Serpinale (rank no. 53) were strongly expressed 
in liver but not in kidney. 

To make out other potential clusters of coregulated genes, we 
systematically analyzed the chromosomal localization of the top 200 
differentially expressed genes (Fig. 1). The localization was deter- 
mined by blasting these probe sequences over the October 2003 
assembly of the mouse genome by using mouseblat on the UCSC 
Genome Browser (24). 

We identified 25 genomic regions containing two or three 
coexpressed genes within <1 Mb (numbered 1 to 25 in Fig. 1) and 
four regions with at least four coregulated genes within <2 Mb 
(labeled A to D in Fig. 1). Using in silico simulations (n = 10,000) 
of the random distribution of 200 points ("genes") in 2.5 Gb, the size 
of the mouse genome (25), we derived the upper 95% confidence 
bound for the probability to obtain at least 29 regions of 1 Mb with 
at least two genes by chance of P < 0.0005 and to obtain four or 
more regions with at least four genes in 2 Mb of P < 0.0005. This 
simulation includes some simplifications, such as neglecting the size 
of a gene in relation to the genome and assuming an equal 
distribution of genes along the genome. Thus, we ran a second 
simulation that is based on the published mouse genome sequence 
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Fig. 1. Chromosomal localization of the top 200 differentially expressed genes based on DNA chip expression data. The top 1 00 genes relatively more abundant in 
liver are shown in red, and the top 100 genes with higher expression in kidney than in liver are shown in green. Genomic regions with two or three regulated genes 
within <1 Mb (numbered 1 to 25)and with four orfive regulated geneswithin<2 Mb (labeled A to D)are shown to the left of each chromosome. Genes not colocalizing 
are shown to the right of each chromosome, blat searches of 14 probe sequences did not match a specific sequence in the public mouse genome database. 



and annotation. For this simulation, we analyzed 10,000 distribu- 
tions of 200 randomly selected genes from the list of all known genes 
in the Mouse May 2004 Assembly (26, 27). For each run, we 
recorded the frequency with which we find at least 29 (resp. 4) 
nonoverlapping windows of 1 Mb (respectively 2 Mb) containing at 
least two (respectively 4) genes. The 95% confidence bounds show 
that the colocalization is significant because the probability to find 
29 small cluster (1 Mb, at least two genes) or to find four larger 
cluster (2 Mb, at least four genes) by chance is P < 0.03 or P < 0.007, 
respectively. 

Genes of at least 10 clusters are paralogous genes of eight 
gene families: Carbonic anhydrases {Car, cluster 2), Fibrinogens 
(Fg, cluster 3), Apolipoproteins (Apo, cluster 6 and 13), Cyto- 
chrome P450 family 2 (Cyp2, clusters 7 and 20), Kallikreins (Klk, 
cluster 8), Serine protease inhibitors (Serpin, cluster D), Inter- 
alpha trypsin inhibitors (Itih, cluster 18), and Solute carriers 
(Sic, cluster 24). . 

To characterize some clusters of coexpressed genes in more 
detail, we included in our analysis genes in the intergenic regions as 



well as genes flanking clusters. For example, the Apolipoprotein 
cluster on mouse chromosome 7 has an evolutionary-conserved 
arrangement in mouse and man (28, 29). In the mouse, Apoe, 
Apocl,Apoc4, andApoc2 are localized within an interval of ~20 kb 
with the same transcriptional orientation. Genes of this Apolipopro- 
tein cluster are expressed stronger in liver than in kidney: Apoe is 
represented by two probes on our DNA chip (rank nos. 72 and 25, 
Table 2), and Apocl,Apoc4, and Apoc2 are each represented by one 
probe (rank nos. 2, 225, and 104, Table 2). The downstream and 
upstream flanking genes of this Apolipoprotein cluster, Tomm40 
and Clptml, are also represented each by one probe (data not 
shown). However, they are not differentially regulated between 
liver and kidney, suggesting that the regulation is confined exclu- 
sively to genes of the Apolipoprotein cluster. 

Differential Proteome of Mouse Liver and Kidney. Samples from 
mouse 1 (Tables 1 and 3) were divided such that RNA and protein 
data were obtained from the identical sample. A total of 2,445 spots 
were detected in the liver proteome compared with 2,261 spots in 
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Fig. 2. Mastergelsof mouse liver and kidney protein extracts (A) and exam pie for the identification of a differential protein signal (B). (A) Each image is a digital 
mastergel of five 2D gels of identical protein extracts. Proteins more abundant in either liver or kidney (red or green, respectively) were excised for PMF. Subscripts 
(3i/2, 1 01/2, 24 1/2 , 25i/2, 34i/2, 44 1/2 . and 45<\n) indicate spots that contained two distinct proteins based on PMF. (B) A visualization of the spot (no. 37) quantification 
in liver (Upper) and kidney (Lower). This protein (Nudt7) had a spot intensity of 254 in liver and was not detected in 2D gels from kidney extracts. 



the kidney proteome (Fig. 7A). To detect differential protein 
expression, the quantification of each spot was compared between 
both organs. With a factor at least 1.5-fold, 366 spots were more 
abundant in kidney as compared with liver, and 439 spots were 
more abundant in liver than in kidney. 

For subsequent PMF, spots were selected based on stringent 
criteria allowing a standard deviation of <30% in five replicates and 
a confidence level of P < 0.05 in the t test (Fig. 2). Based on these 
criteria, 47 differential spots were selected for protein identifica- 
tion. Seven spots consisted each of two distinct proteins (Fig. 2A). 
Mass fingerprinting of three spots did not lead to the identification 
of known proteins, resulting in 51 independent protein identifica- 
tions (33 in liver and 18 in kidney). 

Six proteins (Krtl-18, Cpsl, Mupl, Car3, Vil, and Akrla4) were 
identified within either two or three individual spots, suggesting that 
these major differential proteins are present in different isoforms or 
with different posttranslational modifications. The 51 identified 
proteins thus represent 43 distinct proteins (fully listed in Table 4). 

Many of the major differentially expressed proteins are charac- 
teristic markers for the tissues analyzed. Villin (VU) is a structural 
protein localized in the microvilli of brush borders of proximal 
kidney tubules (30, 31), and aldehyde reductase (Akrla4), with 
previously reported strongest transcription in kidney, is functionally 
involved in the detoxification of reactive aldehyde intermediates 
(32, 33). In liver, expression of the intermediate filament Keratin 18 
(Krtl-18, keratin-type I-cytoskeletal) was previously described in 
epithelia, and mutations in Keratin 18 have been identified as risk 
factors for developing liver disease of multiple etiologies (34, 35). 
The hepatocyte-restricted expression of carbamoyl phosphate syn- 
thetase I (Cpsl) restricts the urea cycle to liver (36, 37). Carbonic 
anhydrase 3 (Cart) and major urinary protein 1 (Mupl) are known 
for their expression and physiological function in liver (38-40). The 
finding that a considerable number of the identified proteins are 



characteristic markers for the examined tissues gives confidence in 
the differential protein data. 

Assessment of Transcript and Protein Functions. To compare the 
functions of differential transcripts and proteins, we collected the 
functional annotations (biological process and molecular function) 
of all identified proteins and the top 100 differentially expressed 
genes in the Mouse Genome Informatics database (Fig. 3 and Table 
4). More than 70% of the identified proteins were annotated as 
metabolic enzymes or associated with biosynthesis. The majority of 
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Fig. 3. Pie charts of functional categories for genes more abundant in liver or 
kidney at the protein or transcript level. Data are based on the top 100 differen- 
tially expressed transcripts and the 43 identified proteins. Standardized Gene 
Ontology (GO) classifications were extracted from the Mouse Genome Informat- 
ics database. 
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the remaining functionally annotated proteins were either transport 
(11% in liver and 6% in kidney) or structural (7% in liver and 6% 
in kidney) proteins (Fig. 3). The functional categories among 
transcripts were less dominated by metabolic enzymes (12% in liver 
and 14% in kidney) and comprised more diverse functional anno- 
tations. The latter included genes coding for proteases and protease 
inhibitors and proteins associated with apoptosis. In particular, 14% 
of the transcripts relatively abundant in liver had various unique 
functions. Among differential transcripts, 22% in liver and 36% in 
kidney were from genes without a functional annotation in the 
Mouse Genome Informatics database. In contrast, among the 
identified proteins, 8% in liver and 0% in kidney had no functional 
annotation (Fig. 3). Because the functional annotations for both 
proteins and transcripts were derived from the same database and 
because the distribution of categories of functions is different for 
proteins and transcripts, the distribution of functional categories 
does not merely reflect the makeup of the database. Instead it is 
conceivable that the differences in the relative abundance of 
functional categories are due to different preferences of the pro- 
teomics and transcriptomics methods (see also Discussion). 

Transcriptional Versus Translational Regulation. We assessed the 
differential expression at the RNA level of those proteins that were 
identified as relatively abundant in either organ. Of the 43 identified 
proteins, 37 were represented by at least one probe on our mi- 
croarray (Table 4). Nine genes were represented by two (Cai, Scp2, 
Actr3, Car3, Arg-1, Hmgcs2, and Akrla4\ three (Acoxl), or four 
(Mup gene family) probes (Table 4). 

In liver, 18 of the 24 proteins (75%) for which DNA probes were 
present were also significantly more abundant at the transcript level 
in liver as compared with kidney. In addition, for three genes 
(Rad23b, Kritl-18, and Gptl), DNA chip experiments suggested 
reproducible up-regulation on all slides on which spots could be 
identified. For two genes (Rnf20 and Actr3) that were highly 
expressed at the protein level, RNA expression profiling did not 
indicate differential regulation (Table 4). For the unknown gene, 
gb|BC026366, only 2 of 16 chips resulted in hybridization signals, not 
allowing assessment of the transcriptional regulation. 

Thirteen of the 18 proteins relatively abundant in kidney were 
represented by at least one probe on the DNA chip. Five of these 
genes (38%) were also significantly more abundant at the RNA 
level in kidney as compared with liver (Atp6vlb2, Arbp, AkrJa4, 
Oxct, and Tpi). In addition, transcripts for two genes (Vil and Ldhl) 
were more abundant in kidney as compared with liver but were 
either not detected on all DNA chips or were reversely regulated on 
one of 16 chips (Table 4). Acoxl had a tendency to be stronger 
expressed in kidney based on DNA chip data. Fumarate hydratase 
1 (Fhl) and fumarylacetoacetate hydrolase (Fah), both major pro- 
teins detected in kidney, were strongly transcribed in liver but not 
in kidney, indicating reverse regulation on the transcript and 
protein levels (Table 4). DNA chip expression profiling did not 
suggest differential regulation of the remaining three genes (Mtx2, 
Aki, and DnahcIJ). 

Taken together, of the 37 proteins that were also represented by 
a probe on the microarray, 29 genes (79%) were either clearly 
regulated with same tendency on all chips (18 in liver and 5 in 
kidney) or on most chips (3 in liver and 3 in kidney). There was 
evidence for no transcriptional regulation of five genes and for 
reverse regulation of two differentially expressed proteins (Fig. 4). 
DNA chip data did not allow assessment of transcript regulation for 
one differentially expressed protein, possibly due to low gene 
expression levels. Although transcriptional and translational regu- 
lation correlate positively for the majority of genes, the comparative 
approach also demonstrates that some proteins are either tran- 
scriptionally not differentially regulated or show a reverse tran- 
scriptional regulation. 
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Fig. 4. Transcript expression of differential proteins. Twenty-nine genes had 
the same tendency at the transcript level (orange), five were not differentially 
regulated at the transcript level (pink), two were reversely regulated (light 
blue), and one was weakly expressed at the RNA level (dark blue). 

Discussion 

Using DNA chip-based expression profiling with >20,200 probes, 
we identified > 1,800 transcripts differentially regulated with high 
statistical significance between mouse liver and kidney. 2D gel 
electrophoresis detected around 2,300 spots in each organ. About 
800 spots were regulated with a factor of at least 1.5-fold. PMF of 
47 isolated spots resulted in the identification of 43 distinct differ- 
ential proteins. We used this rather comprehensive gene expression 
data set as a tool to (i) evaluate functions of differential transcripts 
and proteins, (ii) relate transcriptional and posttranscriptional 
regulation, and (Hi) map differential transcripts to the mouse 
genome. 

The comparison of the functional annotation of the major 
differential proteins and transcripts suggests that protein and 
transcript detection methods reveal functional categories with 
different preference. Metabolic enzymes constitute the largest 
fraction of identified proteins. A minor fraction is associated with 
other functions such as transport or structure. These observations 
corroborate similar. findings made, for example, in the analysis of 
the mouse brain proteome (9, 10). In contrast, differential tran- 
scripts have more diverse functions (Fig. 3). On one hand, the 
relatively low number of diverse functional groups at the protein 
level may be due to current limitations of the proteome analysis 
method. We estimate the detection limit of the proteomics ap- 
proach to at least 1,000 copies of a protein per cell. The proteins 
detected by 2D gel electrophoresis represent the most abundant 
proteins. In addition, we selected the most differential spots for 
protein identification. This experimental limitation is probably one 
important reason why the detected proteins mostly have metabolic 
functions. Thus, regarding differences in protein expression, a 
major distinction between liver and kidney cells seems to be the set 
of metabolic enzymes activated in the respective tissue. The better 
sensitivity of DNA chip expression profiling may be one reason why 
the differential transcripts have more diverse functions. The latter 
included 22% (liver) and 36% (kidney) novel genes and genes 
without functional annotation. Thus, DNA chip-based transcrip- 
tome analysis may also be an efficient method for the identification 
of novel disease-associated genes (41, 42). 

The comparative approach opened the possibility to relate 
regulation at the transcript and posttranscriptional levels. In our 
experimental set-up, we can easily analyze the expression at the 
transcript level of differentially expressed proteins because all 
probes on our DNA chip have been sequenced. The reverse, finding 
the corresponding protein for a differential transcript on the 2D gel, 
would require specific antibodies or a systematic PMF analysis of all 
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spots on 2D gels. The majority of the differential proteins was also 
regulated with the same tendency in DNA chip analyses (Fig. 4). 
This observation suggests that, at least for the most differential 
proteins, gene expression at the transcript level correlates well with 
protein expression. Similarly, a close correlation between mRNA 
and protein expression was suggested in rodent pancreatic islets 
cells (17) and mitochondria from distinct mouse tissues (43). Five 
differential proteins (Rnf20, Actr3, Mtx2, Akl, and Dnahcll) were 
not regulated at the RNA level, suggesting that the differential 
expression of these proteins could be due to the stability or 
differences in secretion or accumulation of these proteins. More- 
over, Fhl and Fah, both abundant proteins in kidney, were strongly 
transcribed in liver but not in kidney, possibly suggesting different 
turnover rates or efficiencies of translation in the two tissues. The 
comparison of gene regulation at the transcript and protein levels 
thus provides a proof-of-principle for the usefulness of the com- 
parative approach. 

Our transcriptome analysis of two functionally diverse tissues led 
to the identification of > 1,000 differentially expressed genes. This 
high number of regulated genes allowed the assessment of chro- 
mosomal colocalizations, resulting in the description of 29 clusters 
of coexpressed genes. Chromosomal regions of coexpressed genes 
have also been identified based on expression profiling data in yeast, 
Caenorhabditis elegans, Drosophila, man, and mouse (44-48). The 
coregulation of closely linked genes through shared sequence 
elements in cis (such as enhancers, repressors, insulators, locus 
control and matrix attachment regions, etc.) has been described for 
gene families such as apoE, a-globin, /3-globin, Hox genes, and 
others (49-52). Similarly, our expression data identified the prox- 
imal Serpin subcluster as linked and differentially regulated genes. 
The arrangement of these genes is conserved between mouse and 
man, except that the human SERPINA1 gene has five isoforms in 
mouse (53). Recently, a control region was identified in the human 
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locus that is required for SERPIN gene activation and for chromatin 
remodeling of the proximal subcluster (54). 

The coregulation of linked genes may be imposed either by 
sharing cis-regulatory interactions or, alternatively, may be associ- 
ated with a more general or long-range property of genomic 
sequences (49, 55, 56). Additional data suggest that at least some of 
the genes, identified here as coexpressed, may indeed be coregu- 
lated through the same regulatory factors. For example, the ex- 
pression of the Serpin and the Fetuin clusters in liver may at least 
in part require the same transcription factors. Human HNF3 (Foxa3 
in mouse; rank no. 167 in our liver expression data; Table 2) is an 
essential factor for the transcriptional regulation of many hepatic 
genes that can affect chromatin structure by displacing linker 
histones at least in the serum albumin enhancer. It was also 
suggested to be one of the potential factors regulating expression of 
SERPIN genes (54, 57, 58). HNF3 binding sites were also identified 
in the liver-specific FETUIN (AHSG) gene. In the mouse, Ahsg and 
Fetub are direct neighboring genes within ~50 kb on chromosome 
16 (Chr. 3 in man) (59). Both genes Ahsg (two probes, rank nos. 9 
and 11), and Fetub (rank no. 636) were strongly expressed in liver 
and weakly expressed in kidney (Table 2). Based on these obser- 
vations, we hypothesize that the colocalization of coexpressed genes 
in our study may at least in part be of functional relevance. 

The clusters of coexpressed genes identified here provide a basis 
for the identification of common regulatory sequences. They are 
currently analyzed systematically in a combination of in silico gene- 
and region-wise, intra- and interspecies comparative approaches. 
Predictions on regulatory sequences must be followed by functional 
mutagenesis studies in vivo. 
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